This script takes a deep dive into Landsat 5 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.
harmonize_version = "2024-04-25"
outlier_version = "2024-04-25"
LS5 <- read_rds(paste0("data/labels/harmonized_LS57_labels_", harmonize_version, ".RDS")) %>%
filter(mission == "LANDSAT_5")
Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly, or if there are differences in the re-pull. The re-pull masks out saturated pixels, so any instance where the “SR_BX” value is NA indicates that the pixel was saturated in at least one band.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
There is some mis-match here, let’s filter any labels where at least one band value doesn’t match between the user pull and the re-pull.
LS5_inconsistent <- LS5 %>%
filter(is.na(SR_B7) | B1 != SR_B1 | B2 != SR_B2 | B3 != SR_B3 |
B4 != SR_B4 | B5 != SR_B5 | B7 != SR_B7)
LS5_inconsistent %>%
group_by(class) %>%
summarise(n_labels = n()) %>%
kable()
| class | n_labels |
|---|---|
| cloud | 214 |
| darkNearShoreSediment | 1 |
| lightNearShoreSediment | 3 |
| openWater | 9 |
| other | 1 |
| shorelineContamination | 3 |
Most of these are cloud labels, where the pixel is saturated, and then masked in the re-pull value (resulting in an NA). Let’s drop those from this subset and then look more.
LS5_inconsistent <- LS5_inconsistent %>%
filter(!is.na(SR_B7))
This leaves 0.6% of the Landsat 7 labels as inconsistent. Let’s do a quick sanity check to make sure that we’ve dropped values that are inconsistent between pulls and where any band value is greater than 1:
LS5_filtered <- LS5 %>%
filter(# filter data where the repull data and user data match
(B1 == SR_B1 & B2 == SR_B2 & B3 == SR_B3 &
B4 == SR_B4 & B5 == SR_B5 & B7 == SR_B7),
# or where any re-pulled band value is greater than 1, which isn't a valid value
if_all(LS57_ee,
~ . <= 1))
And plot:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
And now let’s look at the data by class:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI), so let’s drop those categories and look at the data again. We’ll also drop the B1-B7 columns here.
LS5_for_class_analysis <- LS5_filtered %>%
filter(!(class %in% c("other", "shorelineContamination"))) %>%
select(-c(B1:B7))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
Let’s also go back and check to see if there is any pattern to the inconsistent labels.
| vol_init | n_tot_labs | n_dates |
|---|---|---|
| HAD | 6 | 4 |
| BGS | 5 | 2 |
| LRCP | 3 | 2 |
| AMP | 1 | 1 |
There seem to be just a few inconsistencies here and across multiple dates. This could just be a processing difference (if there happened to be an update to a scene since users pulled these data or if these were on an overlapping portion of two scenes). I’m not concerned about any systemic errors here that might require modified data handling for a specific scene or contributor.
There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.
## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment" "lightNearShoreSediment" "offShoreSediment"
## [4] "openWater"
Okay, 84 of extreme outliers (>1.5*IQR) out of 1625 total labels across non-cloud classes.
Are there any contributors that show up more than others in the outliers dataset?
| vol_init | n_tot | n_out | percent_outlier |
|---|---|---|---|
| BGS | 777 | 59 | 7.6 |
| AMP | 73 | 5 | 6.8 |
| LRCP | 181 | 8 | 4.4 |
| KRW | 49 | 2 | 4.1 |
| HAD | 432 | 10 | 2.3 |
| FYC | 98 | NA | NA |
| LAE | 15 | NA | NA |
These aren’t terrible. All below 10%, nothing egregious.
How many of these outliers are in specific scenes? For the purposes of comparison, we’ll drop the cloud groups.
| date | vol_init | n_out | n_tot | percent_outlier |
|---|---|---|---|---|
| 1988-11-23 | BGS | 28 | 67 | 41.8 |
| 1993-06-30 | HAD | 9 | 24 | 37.5 |
| 1998-10-02 | BGS | 15 | 98 | 15.3 |
| 2010-09-01 | AMP | 5 | 58 | 8.6 |
| 1998-08-31 | LRCP | 8 | 115 | 7.0 |
| 2000-08-04 | BGS | 5 | 74 | 6.8 |
| 1984-07-07 | KRW | 2 | 32 | 6.2 |
| 2006-07-04 | BGS | 3 | 58 | 5.2 |
| 2009-04-23 | BGS | 2 | 53 | 3.8 |
| 2004-08-15 | BGS | 2 | 55 | 3.6 |
| 2006-06-02 | BGS | 3 | 105 | 2.9 |
| 2007-05-04 | HAD | 1 | 57 | 1.8 |
| 2010-04-10 | BGS | 1 | 79 | 1.3 |
There are two scenes here that have a higher proportion of outliers (> 20% of total labels) - perhaps there is something about the AC in these particular scenes? or the general scene quality? Let’s look at the scene-level metadata:
| date | vol_init | DATA_SOURCE_AIR_TEMPERATURE | DATA_SOURCE_ELEVATION | DATA_SOURCE_OZONE | DATA_SOURCE_PRESSURE | DATA_SOURCE_REANALYSIS | DATA_SOURCE_WATER_VAPOR | SENSOR_MODE_SLC | CLOUD_COVER_list | IMAGE_QUALITY_list | mean_cloud_cover | max_cloud_cover |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1988-11-23 | BGS | NCEP | GLS2000 | TOMS | NCEP | MERRA-2 | NCEP | ON | 43; 3 | 9 | 23.0 | 43 |
| 1993-06-30 | HAD | NCEP | GLS2000 | TOMS | NCEP | MERRA-2 | NCEP | ON | 78; 63 | 9; 7 | 70.5 | 78 |
This isn’t terribly helpful, we still need to dig a little more.
How many bands are outliers when the data are aggregated back to label? If there is a large portion of outliers amongst the RGB bands (how users labeled data), there is probably a systemic problem. If the outliers are in singular bands, especially those that are not in the visible spectrum, we can dismiss the individual observations, and probably assert that the scene as a whole is okay to use in training. First pass, if there are 3 or more bands deemed outliers for a particular label, let’s look at the bands that are outliers:
| date | class | vol_init | user_label_id | n_bands_out | bands_out |
|---|---|---|---|---|---|
| 1988-11-23 | openWater | BGS | 1932 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1933 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1934 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1935 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1936 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1938 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1941 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1942 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1943 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1944 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1945 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1947 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1948 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1949 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1950 | 4 | SR_B1; SR_B2; SR_B3; SR_B4 |
| 1988-11-23 | openWater | BGS | 1927 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 1928 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 1929 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 1931 | 3 | SR_B1; SR_B2; SR_B4 |
| 1988-11-23 | openWater | BGS | 1937 | 3 | SR_B1; SR_B2; SR_B3 |
| 1988-11-23 | openWater | BGS | 1946 | 3 | SR_B1; SR_B2; SR_B4 |
| 1998-08-31 | darkNearShoreSediment | LRCP | 847 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-08-31 | darkNearShoreSediment | LRCP | 938 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 1434 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 1435 | 3 | SR_B4; SR_B5; SR_B7 |
| 1998-10-02 | offShoreSediment | BGS | 1436 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-06-02 | lightNearShoreSediment | BGS | 2146 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-07-04 | lightNearShoreSediment | BGS | 140 | 3 | SR_B4; SR_B5; SR_B7 |
| 2006-07-04 | lightNearShoreSediment | BGS | 157 | 3 | SR_B4; SR_B5; SR_B7 |
| 2007-05-04 | darkNearShoreSediment | HAD | 3354 | 3 | SR_B4; SR_B5; SR_B7 |
Let’s group by image date and volunteer and tally up the number of labels where at least 3 bands where outliers:
| date | vol_init | n_labels |
|---|---|---|
| 1988-11-23 | BGS | 21 |
| 1998-10-02 | BGS | 3 |
| 1998-08-31 | LRCP | 2 |
| 2006-07-04 | BGS | 2 |
| 2006-06-02 | BGS | 1 |
| 2007-05-04 | HAD | 1 |
At the very least, there are issues in the 1988-11-23 scene… but is that user error? Atmospheric correction?
This looks super hazy, so that may be the issue with this particular scene. Unfortunately, there is no way to know this from the scene metadata. Let’s see if the SR_ATMOS_OPACITY value is helpful at all:
outliers %>%
filter(date == "1988-11-23") %>%
pluck("SR_ATMOS_OPACITY") %>%
range(., na.rm = T)
## [1] 0.166 0.170
These values are pretty low (this is supposed to indicate hazy conditions), let’s look at cirrus confidcence across these labels, too:
outliers %>%
filter(date == "1988-11-23") %>%
pluck("cirrus_conf") %>%
range(., na.rm = T)
## [1] 0 0
Nothing here. How about cloud_conf?
outliers %>%
filter(date == "1988-11-23") %>%
pluck("cloud_conf") %>%
range(., na.rm = T)
## [1] 1 1
This is low confidence (bit value of 1). How about dialated clouds?
outliers %>%
filter(date == "1988-11-23") %>%
pluck("dialated_cloud") %>%
range(., na.rm = T)
## [1] 0 0
Nope.
While there doesn’t seem to be a systematic way of detecting the haze issue in this scene (or with the labeled pixels), it is clear that there is an issue with this scene. It’s also possible that the surrounding bright snow causes issues with the Rrs calculation.
Do any of the labels have QA pixel indications of contamination? Let’s see if the medium certainty classification in the QA band is useful here:
LS5_for_class_analysis %>%
mutate(QA = case_when(cirrus_conf >=2 ~ "cirrus",
snowice_conf >= 2 ~ "snow/ice",
cloudshad_conf >= 2 ~ "cloud shadow",
cloud_conf >= 2 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| clear | 1145 |
| cloud | 9 |
| cloud shadow | 62 |
| snow/ice | 11 |
And how about high certainty:
LS5_for_class_analysis %>%
mutate(QA = case_when(cirrus_conf >= 3 ~ "cirrus",
snowice_conf >= 3 ~ "snow/ice",
cloudshad_conf >= 3 ~ "cloud shadow",
cloud_conf >= 3 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| clear | 1147 |
| cloud | 7 |
| cloud shadow | 62 |
| snow/ice | 11 |
What about low confidence?
LS5_for_class_analysis %>%
mutate(QA = case_when(snowice_conf >= 1 ~ "snow/ice",
cloudshad_conf >= 1 ~ "cloud shadow",
cirrus_conf >= 1 ~ "cirrus",
cloud_conf >= 1 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| snow/ice | 1227 |
Low confidence is NOT helpful! Let’s move forward with medium confidence and look at the flagged data from all classes except cloud:
LS5_qa_flagged <- LS5_for_class_analysis %>%
filter((cirrus_conf >= 2 |
snowice_conf >= 2 |
cloudshad_conf >= 2 |
cloud_conf >= 2),
class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_qa_flagged = n()) %>%
arrange(-n_qa_flagged)
LS5_tot <- LS5_for_class_analysis %>%
group_by(date, vol_init) %>%
filter(class != "cloud") %>%
summarise(n_tot_labels = n())
left_join(LS5_qa_flagged, LS5_tot) %>%
mutate(percent_qa_flagged = round(n_qa_flagged/n_tot_labels*100, 1)) %>%
arrange(-percent_qa_flagged) %>%
kable()
| date | vol_init | n_qa_flagged | n_tot_labels | percent_qa_flagged |
|---|---|---|---|---|
| 1987-11-05 | HAD | 28 | 41 | 68.3 |
| 1993-06-30 | HAD | 11 | 24 | 45.8 |
| 1990-06-06 | HAD | 8 | 25 | 32.0 |
| 1998-10-02 | BGS | 24 | 98 | 24.5 |
| 2007-05-04 | HAD | 4 | 57 | 7.0 |
| 2004-08-15 | BGS | 3 | 55 | 5.5 |
| 2009-04-23 | BGS | 1 | 53 | 1.9 |
| 2011-10-06 | HAD | 2 | 111 | 1.8 |
| 1988-11-23 | BGS | 1 | 67 | 1.5 |
We don’t want to be using data that has QA flags for any pixel that isn’t labeled cloud. Let’s look at the four images with >20% QA flag labels:
1987-11-05
LS5_for_class_analysis %>%
filter(date == "1987-11-05", class != "cloud",
(cirrus_conf >= 2 |
snowice_conf >= 2 |
cloudshad_conf >= 2 |
cloud_conf >= 2)) %>%
select(class, SR_ATMOS_OPACITY, cirrus_conf, cloud_conf, cloudshad_conf, snowice_conf) %>%
kable()
| class | SR_ATMOS_OPACITY | cirrus_conf | cloud_conf | cloudshad_conf | snowice_conf |
|---|---|---|---|---|---|
| openWater | 0.080 | 0 | 1 | 3 | 1 |
| openWater | 0.062 | 0 | 1 | 3 | 1 |
| openWater | 0.055 | 0 | 1 | 3 | 1 |
| openWater | 0.059 | 0 | 1 | 3 | 1 |
| openWater | 0.066 | 0 | 1 | 3 | 1 |
| openWater | 0.054 | 0 | 1 | 3 | 1 |
| openWater | 0.071 | 0 | 1 | 3 | 1 |
| openWater | 0.059 | 0 | 1 | 3 | 1 |
| openWater | 0.063 | 0 | 1 | 3 | 1 |
| openWater | 0.087 | 0 | 1 | 3 | 1 |
| openWater | 0.072 | 0 | 1 | 3 | 1 |
| openWater | 0.071 | 0 | 1 | 3 | 1 |
| openWater | 0.072 | 0 | 1 | 3 | 1 |
| openWater | 0.074 | 0 | 1 | 3 | 1 |
| openWater | 0.080 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.073 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.074 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.079 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.042 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.034 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.037 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.041 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.075 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.075 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.025 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.026 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.026 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.034 | 0 | 1 | 3 | 1 |
These are entirely cloud shadow flags, which makes sense given the extent and pattern of clouds. This scene does not seem contaminated with haze (which is confimred) by SR_ATMOS_OPACITY values < 0.1.
1993-06-30
LS5_for_class_analysis %>%
filter(date == "1993-06-30", class != "cloud",
(cirrus_conf >= 2 |
snowice_conf >= 2 |
cloudshad_conf >= 2 |
cloud_conf >= 2)) %>%
select(class, SR_ATMOS_OPACITY, cirrus_conf, cloud_conf, cloudshad_conf, snowice_conf) %>%
kable()
| class | SR_ATMOS_OPACITY | cirrus_conf | cloud_conf | cloudshad_conf | snowice_conf |
|---|---|---|---|---|---|
| openWater | 0.663 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.379 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.373 | 0 | 3 | 1 | 1 |
| darkNearShoreSediment | 0.439 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.442 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.415 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.413 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.422 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.442 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.422 | 0 | 1 | 1 | 3 |
| darkNearShoreSediment | 0.411 | 0 | 1 | 1 | 3 |
Most of these are high opacity values, indicating some cirrus-type contamination. Additionally, there is some mis classification of snow and ice here. As indicated by the opacity values and the qa flags, these pixels should be dropped. SR_ATOMS_OPACITY greater than 0.3 indicate haze or other cloud presence.
LS5_for_class_analysis %>%
filter(date == "1993-06-30", class != "cloud") %>%
pluck("SR_ATMOS_OPACITY") %>%
range()
## [1] 0.373 0.663
1990-06-06
LS5_for_class_analysis %>%
filter(date == "1990-06-06", class != "cloud",
(cirrus_conf >= 2 |
snowice_conf >= 2 |
cloudshad_conf >= 2 |
cloud_conf >= 2)) %>%
select(class, SR_ATMOS_OPACITY, cirrus_conf, cloud_conf, cloudshad_conf, snowice_conf) %>%
kable()
| class | SR_ATMOS_OPACITY | cirrus_conf | cloud_conf | cloudshad_conf | snowice_conf |
|---|---|---|---|---|---|
| openWater | 0.267 | 0 | 1 | 3 | 1 |
| openWater | 0.268 | 0 | 1 | 3 | 1 |
| openWater | 0.236 | 0 | 1 | 3 | 1 |
| openWater | 0.241 | 0 | 1 | 3 | 1 |
| openWater | 0.080 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.391 | 0 | 3 | 1 | 1 |
| lightNearShoreSediment | 0.388 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.264 | 0 | 1 | 3 | 1 |
Mostly cloud and cloud shadow here. Atmospheric opacity is a bit lower, but still some that will need to be removed from the training dataset.
1998-10-02
LS5_for_class_analysis %>%
filter(date == "1998-10-02", class != "cloud",
(cirrus_conf >= 2 |
snowice_conf >= 2 |
cloudshad_conf >= 2 |
cloud_conf >= 2)) %>%
select(class, SR_ATMOS_OPACITY, cirrus_conf, cloud_conf, cloudshad_conf, snowice_conf) %>%
kable()
| class | SR_ATMOS_OPACITY | cirrus_conf | cloud_conf | cloudshad_conf | snowice_conf |
|---|---|---|---|---|---|
| openWater | 0.270 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.284 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.286 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.251 | 0 | 1 | 3 | 1 |
| lightNearShoreSediment | 0.241 | 0 | 1 | 1 | 3 |
| lightNearShoreSediment | 0.295 | 0 | 3 | 3 | 1 |
| lightNearShoreSediment | 0.295 | 0 | 3 | 3 | 1 |
| lightNearShoreSediment | 0.296 | 0 | 2 | 1 | 1 |
| lightNearShoreSediment | 0.296 | 0 | 2 | 1 | 3 |
| darkNearShoreSediment | 0.213 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.213 | 0 | 1 | 3 | 1 |
| darkNearShoreSediment | 0.213 | 0 | 3 | 3 | 1 |
| darkNearShoreSediment | 0.213 | 0 | 1 | 3 | 1 |
| offShoreSediment | 0.219 | 0 | 1 | 3 | 1 |
| offShoreSediment | 0.278 | 0 | 1 | 3 | 1 |
| offShoreSediment | 0.271 | 0 | 1 | 3 | 1 |
| offShoreSediment | 0.251 | 0 | 3 | 1 | 1 |
| offShoreSediment | 0.282 | 0 | 2 | 3 | 1 |
| offShoreSediment | 0.236 | 0 | 1 | 3 | 1 |
| offShoreSediment | 0.248 | 0 | 3 | 1 | 1 |
| offShoreSediment | 0.218 | 0 | 3 | 1 | 1 |
| offShoreSediment | 0.275 | 0 | 3 | 1 | 1 |
| offShoreSediment | 0.293 | 0 | 2 | 1 | 1 |
| offShoreSediment | 0.293 | 0 | 3 | 1 | 1 |
It’s clear there is a bit of haze in this image, that is confirmed by opacity values, though the opacity is till below the 0.3 cutoff. Using the QA bits here to remove medium and high confidence QA flags should resolve any contamination. Let’s also look at the range of atmoshperic opacity across all labels in this scene:
LS5_for_class_analysis %>%
filter(date == "1998-10-02", class != "cloud") %>%
pluck("SR_ATMOS_OPACITY") %>%
range()
## [1] 0.034 0.296
All below the 0.3 threshold.
And now for the outliers:
outliers %>%
mutate(QA = case_when(snowice_conf >= 2 ~ "snow/ice",
cloudshad_conf >= 2 ~ "cloud shadow",
cirrus_conf >= 2 ~ "cirrus",
cloud_conf >= 2 ~ "cloud",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_out_tot = n()) %>%
kable()
| QA | n_out_tot |
|---|---|
| clear | 67 |
| cloud | 7 |
| cloud shadow | 4 |
| snow/ice | 6 |
And let’s look at atmospheric opacity:
outliers %>%
filter(class != "cloud") %>%
pluck("SR_ATMOS_OPACITY") %>%
range(na.rm = T)
## [1] 0.027 0.498
How many of these outliers have near-pixel clouds (as measured by ST_CDIST)?
There are 12 labels (14.3% of oultiers) that aren’t “cloud” in the outlier dataset that have a cloud distance <500m and 126 labels (7.8%) in the whole dataset that have a cloud distance <500m. Since this is about the same portion of labels (or they are not severely disproportionate), I don’t think this is terribly helpful.
How many of the outliers have high cloud cover, as reported by the scene-level metadata? Note, we don’t have the direct scene cloud cover associated with individual labels, rather a list of the scene level cloud cover values associated with the AOI.
The outlier dataset contains 9 (10.7%) where the max cloud cover was > 75% and 9 (10.7%) where the mean cloud cover was > 50%. The filtered dataset contains 24 (1.5%) where max was >75% and 47 (2.9%) where the mean cloud cover was > 50%. Neither of these metrics seem helpful in elimination of labels from outlier scenes.
Based on the above review, any label with SR_ATMOS_OPACITY value >= 0.3, or a QA flag for clouds, cloud shadow, or snow ice should be eliminated from this dataset.
LS5_training_labels <- LS5_for_class_analysis %>%
filter(class == "cloud" |
(SR_ATMOS_OPACITY < 0.3 &
cloud_conf < 2 &
cloudshad_conf <2 &
snowice_conf <2))
We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.
Kruskal-Wallis assumptions:
ANOVA assumptions:
We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.
In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:
With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:
## # A tibble: 10 × 9
## band group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 SR_B1 darkNearShore… offSh… 135 273 2.77 0.00552 0.0552 ns
## 2 SR_B2 darkNearShore… offSh… 135 273 0.0529 0.958 1 ns
## 3 SR_B4 darkNearShore… light… 135 257 2.12 0.0338 0.338 ns
## 4 SR_B5 darkNearShore… light… 135 257 2.31 0.0211 0.211 ns
## 5 SR_B5 darkNearShore… offSh… 135 273 -1.72 0.0861 0.861 ns
## 6 SR_B5 offShoreSedim… openW… 273 387 -2.54 0.0111 0.111 ns
## 7 SR_B7 darkNearShore… light… 135 257 2.00 0.0458 0.458 ns
## 8 SR_B7 darkNearShore… offSh… 135 273 -1.36 0.174 1 ns
## 9 SR_B7 darkNearShore… openW… 135 387 -2.38 0.0172 0.172 ns
## 10 SR_B7 offShoreSedim… openW… 273 387 -1.20 0.229 1 ns
There is some consistency here: “darkNearShoreSediment” is often not different from other sediment types by band. It is entirely possible that band interactions overpower these non-significant differences.
Let’s look at the boxplots for the training labels, dropping the cloud labels to see the ranges better:
LS5_training_labels_no_clouds <- LS5_training_labels %>%
filter(class != "cloud")
pmap(.l = list(data = list(LS5_training_labels_no_clouds),
data_name = list("LANDSAT_5"),
band = LS57_ee),
.f = make_class_comp_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
There are 26 outliers in SR_B1, 7 in SR_B2, 8 in SR_B3, 20 in SR_B4, 42 in SR_B5, and 40 in SR_B7.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
There are definitely some varying patterns here, let’s zoom in on the sediment classes.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
Okay, this seems sensical as to why there is no significant difference in many of the band-level “darkNearShoreSediment” labels - there’s a lot of overlap in ranges. Looking at these scatter plot matrices, I do think there are likely different enough patterns when considering multiple bands that ML should be able to pick up on subtle differences.
Things to note for Landsat 5:
pixels with QA flags should be dismissed from model application
pixels with SR_ATMOS_OPACITY > 0.3 should be dismissed from model application
write_rds(LS5_training_labels, paste0("data/labels/LS5_labels_for_tvt_", outlier_version, ".RDS"))